Explanation about the Dataset¶
This dataset, taken from Kaggle, contains information on various health and economic factors for different countries, with the main goal of predicting Life Expectancy. The dataset includes the following columns:
- Country: The name of the country.
- Year: The year when the data was recorded.
- Status: Whether the country is developed or developing.
- Adult Mortality: The death rate among adults aged 15-59.
- Infant Deaths: The number of infant deaths per 1,000 live births.
- Alcohol: Average alcohol consumption per person.
- Percentage Expenditure: The percentage of GDP spent on health.
- Hepatitis B: The percentage of people vaccinated against Hepatitis B.
- Measles: The number of measles cases per 1,000 children.
- BMI: The average Body Mass Index (BMI) of the population.
- Under-Five Deaths: Deaths of children under five years old per 1,000 live births.
- Polio: The percentage of the population vaccinated against polio.
- Total Expenditure: The total amount spent on health per person.
- Diphtheria: The percentage of the population vaccinated against diphtheria.
- HIV/AIDS: The percentage of the population affected by HIV/AIDS.
- GDP: Gross Domestic Product per person, a measure of a country's wealth.
- Population: The total population of the country.
- Thinness (1-19 years): The percentage of underweight people aged 1-19.
- Thinness (5-9 years): The percentage of underweight children aged 5-9.
- Income Composition of Resources (ICOR): A measure of how well a country uses its resources.
- Schooling: The average number of years of schooling for the population.
These features help us understand what factors influence life expectancy in different countries. By analyzing this data, we can identify which aspects—like healthcare, education, and economic development—are most important for improving life expectancy.
How Life Expectancy is Calculated for a Country¶
Life Expectancy is the average number of years a person is expected to live, based on various factors like health, lifestyle, and the country's healthcare system. Here's how it is generally calculated:
Collect Mortality Data:
- The first step is to gather data on how many people die at each age in a population. This data usually comes from national health reports or government statistics.
Calculating Life Expectancy:
- The life expectancy is calculated by adding up the years people are expected to live at each age group, and dividing by the total number of people. This gives the average number of years a person will live, on average.
Factors Affecting Life Expectancy:
- Healthcare: Access to medical services increases life expectancy by reducing deaths from disease.
- Socioeconomic Factors: Wealthier countries with better nutrition, sanitation, and healthcare tend to have higher life expectancy.
- Lifestyle Choices: Factors like diet, exercise, and smoking impact life expectancy. Healthier lifestyles lead to longer life expectancy.
- Environment: Clean air and water, and good living conditions also play a role.
- Government Policies: Public health programs and government investment in infrastructure help improve life expectancy.
Global Differences:
- Life expectancy can vary greatly between countries. For example, richer countries may have life expectancies over 80 years, while some poorer countries may have much lower life expectancy due to health crises or lack of resources.
Life expectancy is a common measure of a country's overall health and development, and helps policymakers understand how different factors (like healthcare, lifestyle, or poverty) influence public health.
print(data.columns)
Index(['Country', 'Year', 'Status', 'Life expectancy', 'Adult Mortality',
'infant deaths', 'Alcohol', 'percentage expenditure', 'Hepatitis B',
'Measles', 'BMI', 'under-five deaths', 'Polio', 'Total expenditure',
'Diphtheria', 'HIV/AIDS', 'GDP', 'Population', 'thinness 1-19 years',
'thinness 5-9 years', 'Income composition of resources', 'Schooling'],
dtype='object')
print(data.head())
Country Year Status Life expectancy Adult Mortality \ 0 Afghanistan 1.621762 -0.459399 -0.443691 0.790238 1 Afghanistan 1.404986 -0.459399 -0.979279 0.854614 2 Afghanistan 1.188210 -0.459399 -0.979279 0.830473 3 Afghanistan 0.971434 -0.459399 -1.021286 0.862660 4 Afghanistan 0.754658 -0.459399 -1.052791 0.886801 infant deaths Alcohol percentage expenditure Hepatitis B Measles \ 0 0.268824 -1.133571 -0.335570 -0.635971 -0.110384 1 0.285786 -1.133571 -0.334441 -0.755661 -0.168124 2 0.302749 -1.133571 -0.334594 -0.675868 -0.173531 3 0.328193 -1.133571 -0.332096 -0.556178 0.032045 4 0.345155 -1.133571 -0.367862 -0.516281 0.051757 ... Polio Total expenditure Diphtheria HIV/AIDS GDP \ 0 ... -3.268019 0.889486 -0.730578 -0.323445 -0.483546 1 ... -1.048077 0.897493 -0.857092 -0.323445 -0.481553 2 ... -0.877312 0.877476 -0.772749 -0.323445 -0.480218 3 ... -0.663856 1.033609 -0.646235 -0.323445 -0.477539 4 ... -0.621165 0.773387 -0.604064 -0.323445 -0.520044 Population thinness 1-19 years thinness 5-9 years \ 0 0.343993 2.796805 2.757185 1 -0.203706 2.864687 2.801550 2 0.311126 2.909942 2.845914 3 -0.148469 2.955197 2.912461 4 -0.160246 3.023079 2.956826 Income composition of resources Schooling 0 -0.704483 -0.563614 1 -0.718710 -0.593391 2 -0.747164 -0.623168 3 -0.780360 -0.652944 4 -0.823042 -0.742275 [5 rows x 22 columns]
Explanation of the Picture:¶
The graph shows the number of countries classified as Developed and Underdeveloped based on the Status column in the dataset.
- Underdeveloped Countries: There are 2,426 countries marked as Underdeveloped.
- Developed Countries: There are 512 countries marked as Developed.
Interpretation:¶
This chart shows that most countries in the dataset are classified as Underdeveloped, while only a smaller number are considered Developed. This reflects the global distribution where fewer countries are fully developed in terms of healthcare, economy, and social indicators. This classification can help identify trends related to Life expectancy, GDP, and other important factors.
Comparison of Average Life Expectancy: Top 10 Countries with Lowest vs. Highest Life Expectancy¶
Interpretation of Life Expectancy Data:¶
The table shows the top 10 countries with the lowest and highest life expectancy based on the available data.
Top 10 Countries with Lowest Life Expectancy: The countries with the lowest life expectancy include Sierra Leone, Central African Republic, and Lesotho. These countries have negative life expectancy values, indicating some issues with the data, such as missing values or anomalies during the data collection. In real-world cases, negative life expectancy is not possible, and further data cleaning or investigation may be required.
Top 10 Countries with Highest Life Expectancy: On the other hand, countries like Japan, Sweden, and Iceland show the highest life expectancy values. These nations have consistently high life expectancy, likely due to better healthcare systems, quality of life, and overall economic conditions. Japan, for example, has one of the longest life expectancies globally, reflecting its excellent healthcare, diet, and lifestyle.
This data highlights the disparities in life expectancy across different regions, suggesting that countries with robust healthcare systems and stable economic conditions tend to have higher life expectancies, while those with poorer health infrastructure and other challenges face lower life expectancies. However, it is important to address the data quality issues, especially for countries with negative values.
Interpretation of GDP vs Life Expectancy Scatter Plot:¶
The scatter plot indicates a positive relationship between Life Expectancy and GDP. This suggests that as a country's GDP increases, its average life expectancy tends to improve.
Developed nations are clustered in the higher GDP range, and they generally have higher life expectancy values, indicating that wealthier countries with better economic conditions are able to provide better healthcare, education, and living standards, contributing to longer life expectancy.
Underdeveloped nations, on the other hand, are situated in the lower GDP range and typically exhibit lower life expectancy values. This reflects the challenges faced by poorer countries, including limited access to healthcare, resources, and infrastructure.
In summary, the graph shows that developed nations tend to have higher life expectancy than underdeveloped nations, primarily due to their stronger economic conditions.
This table shows the correlation values between various factors and life expectancy. A correlation value measures the strength and direction of a relationship between two variables. A positive value (closer to +1) means that as one variable increases, the other also tends to increase. For example, Schooling has a strong positive correlation of 0.77 with life expectancy, which suggests that higher levels of education are associated with longer life expectancy. On the other hand, a negative correlation (closer to -1) means that as one variable increases, the other tends to decrease. For instance, Adult Mortality has a negative correlation of -0.70, indicating that higher mortality rates are associated with shorter life expectancy. Variables like GDP (0.44) and Status (0.48) have moderate positive correlations, suggesting that wealthier and more developed countries tend to have higher life expectancies. Other factors, such as Population (-0.03), show almost no correlation, meaning that changes in population size don’t have a strong relationship with life expectancy. In summary, these correlations help us understand which factors are most closely related to life expectancy, and whether those relationships are positive or negative. This insight can guide policy and research on improving public health outcomes.
Feature Engineering¶
Why Model Selection is Important¶
Model selection is a critical step in machine learning because the choice of model can significantly affect the performance of the prediction. A well-chosen model can accurately capture the patterns in the data, while a poor model might result in inaccurate predictions and overfitting or underfitting.
There are various types of models available, each with strengths and weaknesses depending on the nature of the data, the complexity of the relationships, and the task at hand. Hence, it is essential to try different models and evaluate their performance to choose the most suitable one.
Why R² Test was Performed on Different Models¶
The R² (R-squared) score is a commonly used metric to assess the performance of regression models. It measures the proportion of the variance in the dependent variable (in this case, life expectancy) that is predictable from the independent variables (features). The R² score ranges from 0 to 1, where:
- A value of 1 means the model perfectly explains the variance of the target variable.
- A value of 0 means the model does not explain any of the variance.
Since we are predicting life expectancy, which is a continuous variable, evaluating models based on their R² score is essential to understand how well the model can explain the variability in life expectancy based on the selected features.
Models Evaluated:¶
RandomForestRegressor: Random forests are an ensemble learning method that uses multiple decision trees to improve predictive accuracy and control overfitting. It works well with both linear and non-linear relationships, which is why it was tested here.
XGBRegressor: XGBoost is a powerful gradient boosting model that optimizes for speed and accuracy. It is particularly good at handling large datasets and complex relationships.
RidgeCV: Ridge regression is a form of linear regression that includes an L2 regularization term. It helps prevent overfitting by penalizing large coefficients and is particularly useful when dealing with multicollinearity.
LinearRegression: Linear regression is a simple and widely-used model for regression tasks. It assumes a linear relationship between the independent variables and the target variable, making it easy to interpret.
Ridge: Similar to RidgeCV, this model uses regularization to prevent overfitting, but it lacks the cross-validation built into RidgeCV for selecting the optimal regularization parameter.
GradientBoostingRegressor: This is another ensemble method that builds trees sequentially, with each tree trying to correct the errors of the previous one. It's very effective for capturing complex, non-linear patterns.
By performing the R² test on each of these models, we aim to determine which model best explains the variance in life expectancy and provides the most accurate predictions. The R² score helps compare the effectiveness of each model in terms of how well they fit the data.
Model Performance:
Model R² Score
0 RandomForestRegressor 0.969662
2 XGBRegressor 0.967337
5 RidgeCV 0.962525
3 LinearRegression 0.961761
4 Ridge 0.954758
1 GradientBoostingRegressor 0.952952
The best model is RandomForestRegressor with an R² score of 0.9697
The code aims to build and evaluate multiple regression models to predict Life Expectancy using a given dataset. It begins by loading the data into a DataFrame and handling any missing values by filling them with the column mean. To prepare the data for machine learning models, it encodes categorical variables using one-hot encoding, ensuring all columns are numerical. The target variable (Life expectancy) is separated from the features, followed by splitting the data into training (80%) and testing (20%) sets. Six regression models are then initialized: RandomForestRegressor, GradientBoostingRegressor, XGBRegressor, LinearRegression, Ridge, and RidgeCV. Each model is trained on the training set and evaluated using the R² (coefficient of determination) score on the test set, which indicates how well the model explains the variability of the target variable. The performance of each model is stored in a list, which is then converted into a DataFrame for easy comparison of scores. Finally, the models' performance is printed, showing which model best predicts life expectancy based on the provided features.
From the table, we can see that the RandomForestRegressor achieved the highest R² score of 0.969662, meaning it performs the best in predicting life expectancy.
Conclusions¶
In this project, we developed a machine learning model to predict Life Expectancy using various features with good accuracy. The model was trained using a Random Forest Regressor, achieving a high R² score of 0.9697, indicating a strong correlation between the features and the target variable.
From our analysis, we found that HIV/AIDS is the most important feature for predicting Life Expectancy, contributing the most to the model's performance. This was followed by Adult Mortality and Schooling, which also showed significant importance in explaining variations in Life Expectancy. Other key features such as BMI, thinness (5-9 years), and Alcohol also contributed, but to a lesser extent.
Our model also highlighted the relationships between some features and Life Expectancy. For instance, HIV/AIDS and Adult Mortality have a strong negative correlation with Life Expectancy, as expected, because higher mortality rates typically lead to lower life expectancy. On the other hand, features such as Schooling and GDP have a positive relationship with Life Expectancy, which can be attributed to the fact that more educated and wealthier nations tend to have higher life expectancies.
Interestingly, we observed that Alcohol consumption, while positively correlated with Life Expectancy in some regions, is also influenced by the level of development, as countries with higher alcohol consumption often have better healthcare systems and longer life expectancies. However, this positive relationship becomes weaker or even negative when we account for other factors like ICOR Schooling (PCA).
Overall, the insights from this project underline the importance of factors such as healthcare, education, and socio-economic development in determining life expectancy across different countries. By understanding the feature importances, we can better target interventions to improve life expectancy in regions that are lagging behind.
References¶
- Life Expectancy calculation: Our World in Data
- Model Selection: Scholar hat